-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ARO operator Azure auth scheme to use a DefaultAzureCredential #3274
Update ARO operator Azure auth scheme to use a DefaultAzureCredential #3274
Conversation
08dc156
to
b80a782
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, looks great! I just have a question about the continued use of an old function.
f9405b6
to
ad91451
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work - just a quick first pass on this
pkg/util/kubernetes/deployments.go
Outdated
d.Spec.Template.ObjectMeta.Annotations["kubectl.kubernetes.io/restartedAt"] = time.Now().Format(time.RFC3339) | ||
|
||
_, err = cli.Update(ctx, d, metav1.UpdateOptions{}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - no new line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we patch this? We've been having issues with running updates instead of patches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to making this a patch call, especially considering we just want to ensure an annotation is added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I became aware while changing this over to a patch call that there are multiple ways to patch. After looking at the different options, I went with server-side apply. Let me know if we should do it a different way though.
@@ -201,6 +201,8 @@ func (m *manager) Update(ctx context.Context) error { | |||
steps.Action(m.renewMDSDCertificate), | |||
steps.Action(m.updateOpenShiftSecret), | |||
steps.Action(m.updateAROSecret), | |||
steps.Action(m.restartAROOperatorMaster), // depends on m.updateOpenShiftSecret; the point of restarting is to pick up any changes made to the secret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to wait for the reconciliation of the credentialsRequest before we restart it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I think we want to, but I'll have to think about how to check that the reconciliation is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did somewhat of a deep dive while figuring out how to go about this. The CredentialsRequest's status has a lastSyncCloudCredsSecretResourceVersion
that we could check, but it feels hacky because we would have to save the old version prior to updating the secret so that we can see what it used to be. I'm thinking we can just use the lastSyncTimestamp
instead by checking to see if it's something within the past 5 minutes. Your thoughts?
It's also tricky because CredentialsRequests haven't made their way into openshift/client-go. Based on my reading, I'll have to use https://pkg.go.dev/k8s.io/client-go/dynamic#DynamicClient to work with the CredentialsRequest from within pkg/cluster
, but let me know if there's a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a solution that uses the lastSyncTimestamp
, and I feel like this should be fine.
@@ -22,18 +22,14 @@ type servicePrincipalChecker interface { | |||
type checker struct { | |||
log *logrus.Entry | |||
|
|||
credentials func(ctx context.Context) (*clusterauthorizer.Credentials, error) | |||
getTokenCredential func(azEnv *azureclient.AROEnvironment, credentials *clusterauthorizer.Credentials) (azcore.TokenCredential, error) | |||
getTokenCredential func(azEnv *azureclient.AROEnvironment) (azcore.TokenCredential, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose of this as a struct? So that the controller can fetch a new token credential every time it needs to run the checker, rather than once on startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also wondering if this is "legacy". The newer azure-sdk-for-go no longer requires us to refresh the token iirc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose of this as a struct? So that the controller can fetch a new token credential every time it needs to run the checker, rather than once on startup?
I'm not 100% sure, but it seems that way to me.
I'm also wondering if this is "legacy". The newer azure-sdk-for-go no longer requires us to refresh the token iirc.
I wasn't aware of this. I'll have to look into it a bit, and then maybe we can remove unneeded stuff from the struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some digging, I found that you're right - as long as we're using azidentity, our tokens will get auto-refreshed before they expire. So I'll see about refactoring this struct a bit.
For context, #2667 switched us over to azidentity
, which uses MSAL under the hood, and https://learn.microsoft.com/en-us/entra/identity-platform/msal-overview states that MSAL refreshes tokens automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, after looking at this code in more detail and considering how we might refactor it, I think we might as well leave it how it is.
It looks like for all of the checkers under pkg/operator/controllers/checkers
, there's a pattern of defining a SomethingChecker
interface and then creating a struct type that implements the interface. So it logically makes sense to have that struct type encapsulate anything related to the Check
that the controller in question does. I think getTokenCredential
and newSPValidator
are included in the struct more because its logical than because of a goal of getting a new token each time the controller reconciles (which is what I was originally thinking).
Additionally, if you look at the storage accounts controller for example (specifically
authorizer, err := azRefreshAuthorizer.NewRefreshableAuthorizerToken(ctx) |
Reconcile
s. So if we wanted to refactor to only create a token credential on controller startup, we'd probably want to do that to all of the controllers that leverage Azure clients and not just the service principal checker.
Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first pass lgtm, some comments below. If you haven't already done so, please rebase.
pkg/util/kubernetes/deployments.go
Outdated
d.Spec.Template.ObjectMeta.Annotations["kubectl.kubernetes.io/restartedAt"] = time.Now().Format(time.RFC3339) | ||
|
||
_, err = cli.Update(ctx, d, metav1.UpdateOptions{}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to making this a patch call, especially considering we just want to ensure an annotation is added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, other than some additional test coverage if possible. If that's too much for this PR then maybe in a follow up PR.
Thanks for the neat commit structure.
ad91451
to
474d638
Compare
This ensures that the ARO operator will be working with the latest azure-cloud-credentials Secret content.
- Removed a test from pkg/util/clusterauthorizer; there's no longer anything to test since we don't have direct access to the Azure credential data when using a DefaultAzure Credential - Fixed misc. unit tests to reflect changes to GetTokenCredential in pkg/util/clusterauthorizer - Simplified checking logic in certain places - Renamed an import to make the linter happy
- Removed now unused AzCredentials function - Changed ARO operator deployment wait time during `az aro update` from 20 minutes -> 5 minutes
- Updated Restart in pkg/util/kubernetes to use server-side apply - Updated Restart in pkg/operator/deploy to only return an error after at least attempting to restart all of the deployments passed in - Unit tests for Restart in pkg/util/kubernetes - E2E test for ARO operator master deployment's restart upon cluster update
cc35b92
to
a5f5076
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding test coverage, lgtm.
// ServicePrincipalProfile to the contents of the kube-system/azure-cloud-provider Secret. If the CSP | ||
// has changed, it returns true along with a new corev1.Secret to use to update the Secret to match | ||
// what's in the cluster doc. | ||
func (m *manager) servicePrincipalUpdated(ctx context.Context) (bool, *corev1.Secret, error) { | ||
var changed bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible I'd rather see the code that is in the if changed
block extracted to its own helper function and used along with a return
when we need to set changed=true
. this avoids the use of a tracking boolean and reduces complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see what you mean. I've pushed a commit that I believe implements this the way you have in mind.
/azp run ci |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested both a new install and PUCM with these changes to confirm behavior looks correct - lgtm, thanks for addressing all of the changes!
…Azure#3274) * Update the cluster authorizer to use a DefaultAzureCredential * Update the ARO operator to set and use DefaultAzureCredential via env vars * Add a CredentialsRequest to the ARO operator deployment * Restart the ARO operator upon `az aro update` * Removed now unused AzCredentials function * Changed ARO operator deployment wait time during `az aro update` from 20 minutes -> 5 minutes * Refactor CliWithApply to generalize to different object types * Updated Restart in pkg/util/kubernetes to use server-side apply * Updated Restart in pkg/operator/deploy to only return an error after at least attempting to restart all of the deployments passed in * E2E test for ARO operator master deployment's restart upon cluster update * Wait for the ARO operator's CredentialsRequest to be reconciled before restarting
Which issue this PR addresses:
What this PR does / why we need it:
This PR updates the ARO operator to authenticate to Azure using a DefaultAzureCredential, which will work with both a CSP and WI.
To get it working with a CSP, I had to make a few other changes:
az aro update
process will also now verify that the CredentialsRequest mentioned in the bullet above has been reconciled (and thus the Secret has been updated) before doing this restart.Note that to get the ARO operator working on an ARO WI cluster later on, we'll need a follow-up PR to have the operator Pod 1) use
AZURE_FEDERATED_TOKEN_FILE
instead ofAZURE_CLIENT_SECRET
and 2) mount a bound service account token (more info in testing steps below because I had to do it manually; the testing steps could also be a useful reference for the engineer who makes the follow-up PR).Test plan for issue:
There were a few different things to test:
az aro update --refresh-credentials
az aro update --refresh-credentials
against my dev cluster and observed the ARO operator master Pod get replaced during the Deployment rollout.ccoctl azure create-all
, I copied the ARO operator's CredentialsRequest manifest into the--credentials-request-dir
so that ccoctl would create a WI and a Secret manifest for the ARO operator. I also removed this Secret manifest from the install manifest directory before installing the cluster and saved the manifest so I could apply it myself later.pkg/operator/deploy
and the Secret manifest that I had saved during step 1. I did make a few changes to the Deployment manifest: 1) replacedAZURE_CLIENT_SECRET
withAZURE_FEDERATED_TOKEN_FILE
and 2) mounted a bound service account token (see [2] for the things to add).[1] https://github.com/openshift/cloud-credential-operator/blob/master/docs/azure_workload_identity.md
[2] Two things to add in the ARO operator Deployment spec to mount the bound service account token:
Is there any documentation that needs to be updated for this PR?
No